Expectation-Propagation for Summary-Less, Likelihood- Free 
Inference 

Simon Barthelme (BCCN &i TU Berlin, simon.barthelmeQbccn-berlin.de) 
Nicolas Chopin (CREST-ENSAE, nicolas.chopin^ensae.fr) 



O 



o 
u 

-(— > 



> 

in 
in 

o 



> 

X 



Abstract 

Many models of interest in the natural and social sciences have no closed-form likelihood function, which means that they cannot 
be treated using the usual techniques of statistical inference. In the case where such models can be efficiently simulated, Bayesian 
inference is still possible thanks to the Approximate Bayesian Computation (ABC) algorithm. Although many refinements have 
since been suggested, the technique suffers from three major shortcomings. First, it requires introducing a vector of "summary 
statistics", the choice of which is arbitrary and may lead to strong biases. Second, ABC may be excruciatingly slow due to very low 
acceptance rates. Third, it cannot produce a reliable estimate of the marginal likelihood of the model. 

We introduce a technique that solves the first and the third issues, and considerably alleviates the second. We adapt to the 



likelihood-free context a variational approximation algorithm. Expectation Propagation (Minka 20011. The resulting algorithm is 



shown to be faster by a few orders of magnitude than alternative algorithms, while producing an overall approximation error which 
is typically negligible. Comparisons are performed in three real-world applications which are typical of likelihood-free inference, 
including one application in neuroscience which is novel, and possibly too challenging for standard ABC techniques. 

Key-words: Approximate Bayesian Computation; Expectation Propagation; Likelihood-Free Inference; Quasi-Monte Carlo. 



1 Introduction 

In natural and social sciences, one finds many examples of probabilistic models whose likelihood function is intractable. 



This includes most models of noisy biological neural networks (Gerstner and Kistler 20021, some time series and choice 
models in Economics (Train 2003), phylogenetic models in evolutionary Biology (Beaumont 20101, among others. That 



the likelihood is intractable is unfortunate, because one would still like to perform the usual statistical tasks of parameter 
inference and model comparison, and the traditional statistical tool-kit assumes that the likelihood function is either 
directly available or can be made so by introducing latent variables. This explains that researchers have often had to 
content themselves with semi-quantitative analyses showing that a model could reproduce some aspect of an empirical 
phenomenon for some values of the parameters; for two representative examples from vision science, see |Nuthmann et al.[ 
[20T0l [Brascamp et al.ip006| 



A breakthrough was provided by the work of Pritchard et al. (19991 and Beaumont et al. (20021, in the form of the 
Approximate Bayesian Computation algorithm, which enables Bayesian inference in the likelihood- free context. Assuming 
some model p{y*\9) for the data y* , a prior p{9) over the parameter 6 ^ Q, the following procedure produces exact 
samples from the posterior p{9\y*): 

1. Draw 9 from the prior, 9 ^ p{9). 

2. Draw a dataset y from the model conditional on 9, y\9 ^ p{y\9). 

3. If y = y* then keep 9, otherwise reject. 

The acceptance rate provides an estimate of the evidence (marginal likelihood) p{y*) — J p{9)p{y*\9) d9. Of course, if y 
is continuous, then the procedure is hopeless - we will end up rejecting every value of 6 we draw. Even if y is discrete, 
the acceptance probability may still be ridiculously low. The solution put forward by [Pritchard et al ( |1999 l is to define 
some pseudo-distance (i(y,y*) between real and simulated datasets and accept samples if the distance is smaller than 
some value e. This produces samples from the so-called ABC posterior: 



p,{9\y*)^p{9) / P(y|0)l{d(y,y*)<.}dy. 



(1.1) 

The pseudo-distance is usually taken to be d{y,y*) = ||s(y) — s(y*) ||, for some norm || • ||, where s(y) is a vector 
of summary statistics, for example some empirical quantiles or moments of y. Unless s is sufficient, the approximation 
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error does not vanish as e — > 0, i.e. p{9\s{y*)) ^ p{d\y*)- In that respect, the ABC posterior suffers from two levels of 
approximation: a nonparametric error governed by the bandwidth e (see e.g. 
summary statistics s. 



Blum 2010'), and a bias introduced by the 
The more we include in s (y), the smaller the bias induced by s should be. On the other hand, as 
the dimensionality of s(y) increases, the lower the acceptance rate will be. We would then have to increase e, which leads 
to an approximation of lower quality. 

Thus, ABC requires in practice some more or less arbitrary compromise between what summary statistics to include 
and how to set e. To establish that the results of the inference are somewhat robust to these choices, many runs of the 
algorithm are required. Although several variants of the original ABC algorithm that aim at increasing acceptance rates 
exist (e.g. Beaumont et al. 20021, the current state of the matter is that an ABC analysis is very far from routine use 



because it may take days to tune on real problems. 

Another important limitation of ABC is that introducing a summary statistics s makes it impossible to perform model 



choice in most cases (Robert et al. 2011 ). The normalising constant of ( 1.1 ) approximates p{s{y*)), not p{y), and, except 
in very special cases, the former quantity has no clear interpretation in terms of model choice. 

In this article we introduce EP-ABC, an adaptation of the Expectation Propagation algorithm (Minkal 2001 Bishop 
|2006[ Chap. 10) to the likelihood-free setting. EP-ABC is much faster than previous ABC algorithms: typically, it 
provides accurate results in a few minutes, whereas a standard ABC algorithm need a few hours. EP-ABC incorporates 
one data-point yi at a time, which makes it possible to do away with summary statistics, and constrain all the data-points 
yi to be close to the y*'s, while maintaining a reasonable computational cost. More formally, EP-ABC builds an EP 
approximation of the following type of ABC posterior distributions: 



p,(0iy*)(xp(0)n|yp(y,iz/^,_i,0)ii„^^_^.,[<^idy,| 



(1.2) 



with the convention that p{yi\y*.Q,6) — p(yi\6), and assuming that the datasets y and y* may be decomposed into n 
random variables, yi and y* . When e — >■ 0, this ABC posterior converges to the true posterior, and its normalising constant 
(up to to some simple transformation we describe later) converges to the the true evidence p{y*). 

EP-ABC also suffers from two levels of approximation (EP, then ABC). However, as we show in the cases we study, 
one is better off with a good approximation to a very good ABC approximation to the posterior (that is, corresponding 
to a small value of e, and not based on a summary statistics), than with an exact version of a bad ABC approximation 
(based on a large e and some arbitrary set of summary statistics) . 

We start with a generic description of the EP algorithm, in Section |2] and explain in Section [3] how it can be adapted 
to the likelihood-free setting. We explain in Section |4] how EP-ABC can be made particularly efficient when data-points 
are IID (independent and identically distributed). Section p^ contains three case studies drawn from Finance, population 
ecology, and vision science. The two first examples are borrowed from already known applications of ABC, and illustrate 
to which extent EP-ABC may outperform standard ABC techniques in realistic scenarios. To the best of our knowledge, 
our third example from vision science is a novel application of likelihood- free inference, which, as which shall argue, seems 
too challenging for standard ABC techniques. 

We use the following notations throughout the paper: bold letters refer to vectors or matrices, e.g. 6, A, S and so 
on. We also use bold face to distinguish complete sets of observations, i.e. y or y*, from their components, yi, and y*, 
i — 1, . . .n, although we do not assume that these components are necessarily scalar. For sub-vectors of observations, we 
use the colon notation: yi-^ — (j/i, . . . ,yi). The notation || • || refers to a generic norm, and || • ||2 refers to the Euclidean 
norm. The KuUback-Leibler divergence between probability densities tt and q is denoted as 



KLinWq)^ / ^(0)log 






dO. 



The letter p always refers to the probability densities concerning the model; i.e. p{9) is the prior, p{yi\0) is the likelihood 
of the first observation, and so on. Transpose of a matrix A is denoted A'. 



2 Expectation Propagation 



Expectation Propagation (EP, Minka 20011 is an algorithm for variational inference, a class of techniques that aim at 
finding a tractable probability distribution q{0) that best approximates an intractable target density tt{6). One way to 
formulate this goal is as finding the member of some parametric family Q that is in some sense closest to n (9), where 
"closest" is defined by some divergence between probability distributions. Many variational methods (e.g. Variational 
Bayes, see Chap. 10 of Bishop 2006 1 have as goal the minimisation of the KuUback-Leibler divergence between q and 
TT, KL{q\\TT). This leads to very fast algorithms, but the resulting approximation has some unfortunate properties, for 
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example that of systematically underestimating the variance (Wang and Titterington 2005 1. EP tries instead to minimise 
KL{TT\\q), which may be a more sensible goal from a statistical point of view. 

We begin our discussion of EP with some background material on exponential families and KL divergence. 

2.1 Exponential families, KL divergence and the computation of moments 

In this paper and in most applications of EP, the family Q of potential approximations is a particular natural exponential 
family, with respect to variable G 8 C M*^: 

qx{e)=exp{XH{e)~cf>{X)} (2.1) 

where A G M"^ is the natural parameter, t (6) is a feature map from Q to the feature space M.'^ , and is known variously 



as the partition function or the log-cumulant function. The most interesting property of (j) is the following: 

i(A)= /'t(6>)exp{A*t(6>)-0(A)} d6' = EA{t(6>)} (2.2) 



d 



where E^ denotes an expectation with respect to qx- The derivative of the partition function gives the moment parameters, 
usually denoted by rj. It can be shown that ;^</'(A) is, as a function of A, a smooth invertible map. Thus, one may also 
parametrise qx in terms of r/. 

In this paper, we will take Q to be the set of multivariate Gaussian distributions of dimension d. In natural exponential 
family form, it reads: 

A=(r,Q), qx{e)^exp(-l-e'Qe + r'e] (2.3) 



2 
where Q and r are the natural parameters. The more familiar moment parameters are given by: 

r, = (/x,S), fi = Ex{e) = Q-^r, S = Cov^ (0) - Q-^ 

where Cova denotes the Covariance matrix with respect to qx{0). To approximate a distribution tt with a distribution qx 
in an exponential family, one may minimise KL{'K\\qx) with respect to A, which leads to the equation: 



d\ 

d 
dX 



(A)- f n{9)t{9) de = 



The solution is therefore obtained by "moment matching": the closest member qx to tt is obtained by imposing that 
the moments of t{d) under tt and under qx are identical. In the Gaussian case, the multivariate Gaussian distribution 
Nd (/i., S) closest to TT (as measured by KL divergence) is defined by the equations: 



(1= f 07r(6>) de, s = f{e- (i){e - /x)*7r(6>) de. 



2.2 Assumptions of Expectation Propagation 

EP assumes that the target density 7r(0) decomposes into a product of simpler factors, and exploits this factorisation in 
order to construct a sequence of simpler problems. When 7r(0) is a posterior density, this factorisation typically takes the 
following form: 

n 

TT{e)^p{e\y)^p{e)l[h{e) (2.4) 

2 = 1 

where p{e) is the prior density of parameter e, and the k's correspond to a chain rule decomposition of likelihood p{y\e), 
e.g. li{e) = p{yi\yi;i^i, 6) if y is a set of n observations yi, with the convention that p{yi\yi:Q, e) — p{yi\e). 
EP uses an approximating distribution with a similar structure: 

n 

q{e)^X{fm (2.5) 

i=0 
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where the fi's are known as the "sites". To obtain a Gaussian approximation, one takes fi (6) — exp {^—\O^QiO + r*0), so 
that: 



q{e)^e^A-\e'\Y,QAe+\y^vA e 



(2.6) 



where Qi and r^ are called the site parameters. For the sake of simplicity, we assume from now on that the true prior p{9) 
is Gaussian, with natural parameters Qq and Tq. In that case, the site /o is kept equal to the prior, and only the sites /i 
to /„ need to be updated. Note however that EP may accommodate more general priors. 

In spirit, EP is close to an coordinate-descent optimization algorithm: the sites are updated one by one, in order to 
progressively minimise the pseudo-distance KL{Ti\\q). The next section describes the site update. 

2.3 Site update 



Suppose that ( |2.5| is the current approximation, and one wishes to update site i. This is done by creating a "hybrid" 
distribution, obtained by substituting site i with the true likelihood contribution li{0): 

h{e)^q^,{e)k{e), q-^{e) = \{},{e). (2.7) 

For Gaussian sites, this leads to: 

This hybrid distribution may be interpreted as a posterior distribution, based on a Gaussian prior, with natural 
parameters Q_i and r_i, and the likelihood of one data-point. The idea is that, by getting closer to the hybrid we 
should also get closer to the actual density. To find the KuUback-Leibler projection of the hybrid onto Q, we compute its 
moments, as explained in Section [2. 1[ In the Gaussian case, this gives: 

Zh = fq-.de) hie) de, 

fj-h = ^ j eq-,{e)k{e)de, 

Sh = ^ j ee*q-,{e)k{e)de-tihtii. (2.8) 

Many models, e.g. Gaussian mixtures, or generalised linear models, are such that these moments admit a closed-form 
expression, or at least simplify to one-dimensional integrals, which are easy to compute numerically. For our purpose, 
suffice it to say that the applicability of EP is directly determined by the tractability of the integrals above, that is, how 
easy it is to compute the two first moments of a pseudo-posterior, consisting of a Gaussian pseudo-prior q_i and one 
likelihood factor U. 

Once the moments have been computed, the final step is simply to update the approximation so that it has the same 
moments as the hybrid, completing the KuUback-Leibler projection. This is done by updating the site parameters for site 
i: \i -i— Aft — \ i, with A/, = \{rih), and 77/1 is the moment parameters obtained from the hybrid. In the Gaussian case, 
this gives: 

EP proceeds by looping over sites, updating each one in turn until convergence is achieved. The more general EP 
algorithm for a generic exponential family is described as Algorithm [l] For the particular case where the exponential 
family is the family of Gaussian distributions of dimension d, as described above, one simply takes A — {r,Q), and 
T] = {fi, S). In particular. Step 2 of Algorithm fll corresponds exactly to computing the moments in (2.8 1. 



2.4 Approximation of evidence 



EP also provides an approximation of the normalising constant of (2.4 1, that is, the evidence p(y). This is based on 
the same ideas of updating site approximations through moment matching. We rewrite in a normalised form the target 
distribution 



3 EP-ABC: Adapting EP to likelihood-free settings 



Algorithm 1 Generic EP for exponential families. 



Input: a target density tt (6) = p (6) Y["=i h (S)- 

Initialise Aq to the exponential parameters of the prior po, and local site parameters Ai . . . A„ = 0. Set global approximation 

parameter A to A = X]i=o ■^i = ■^o- Loop over sites i = 1, . . . , n until convergence: 

1. Create hybrid distribution h{0) ex q^i (9) li{9) by setting q^i (9) ex exp (A*_jt (0)) with \ i = A — A,. 

2. Compute moments ?7^ of hybrid distribution, transform to natural parameters X^ = A(t7^). 

3. Update site i by setting A^ = A^ — X i, then reset global parameter A to A = A/j. 
Return moment parameters f] = t] (A). 



p{y) 



and the approximating distribution (2.51 



qio)=p{e)f[^, MO) = e^p(~\o'Q^o + rle\ (2.10) 

i—l * ^ ^ 

where again for simplicity, we have assumed that the true prior is Gaussian, and needs not be approximated in the course 
of the algorithm. 

As above, the update of site i adjusts Ci, Ti and Qi so as to minimise the KuUback-Leibler divergence between the 



Gaussian approximation q and a normalised version of the hybrid distribution. Simple calculations (see e.g. Seeger 20051 
lead to the following expressions for the update of Ci'. 

log (Q) = log (Z,0 - *(r, Q) + *(r_„ Q_,) (2.11) 

where Zh is the normalising constant of the hybrid, as defined in (2.8), r, Q (resp. r_i, Q-i) are the natural parameters 
of the current Gaussian approximation q (resp. of q/ fi ex Hiati/j) ''^^'^ ^('"iQ) is the log-normalising constant of an 
unnormalised Gaussian density: 

vl/(r, Q) = log Ij exp f-^x^Qe + r*©") d9\ = -]^ log |Q/2^| + ^r'Qr. 



For each site update, one calculates Ci as defined in (2.111. Then, at the end of the algorithm, one may return the 
following quantity 

n 

J2^og{Ci} + ^{r,Q)-^{ro,Qo) 

i=l 

as an approximation to the logarithm of the evidence. 

3 EP-ABC: Adapting EP to likelihood-free settings 
3.1 Basic principle 

As explained in the introduction, our objective is to approximate the following ABC posterior 

p,(0|y*)(xp(0)J]|yp(y,|2/*,;_i,0)l|ll^^_,^.|l<^|d2/,| (3.1) 

which corresponds to a particular factorisation of the hkehhood, 

n 

p{y*\9) = l[p{y*\yl.-i,d)- (3-2) 

Note that, in full generality, the y* may be any type of "chunk" of the observation vector y* , i.e. the random variables 
y* may have a different dimension, or more generally different supports. For simplicity, we assume that the prior p{9) is 
Gaussian, with natural parameters Qq and Tq. 



3 EP-ABC: Adapting EP to likelihood-free settings 



Algorithm 2 Computing the moments of the hybrid distribution in the hkehhood-free setting, basic algorithm. 
Inputs: e, y* , i, and the moment parameters /x_i, I]_i of the Gaussian pseudo-prior q^i. 

1. Draw M variates 0['"1 from a Af(/x_i,S_i) distribution. 

2. For each ©["l, draw yj"' - p(y,,|2/*,_i, ^H). 

3. Compute the empirical moments 



M 



m=l *■ ' 'J 

E™=i^'"''{e'"'}'i 



E™=ie["ii 



{lly''"'-a*lt<4 



A4c 



^i -a 



me] 



Mn 



(J'hfJ'h- 



(3.3) 



(3.4) 



Return Zh = Macc/M, fih and S^.. 



112/- 



A standard ABC algorithm is not well suited to approximate (3.1 1, because the prior probability that the n constraints 
~ ^i^ll < e hold simultaneously is exponentially small with respect to n. Yet, this ABC posterior is appealing precisely 



for the same reason: it forces all the components of y to be close to those of y*, and does not rely on an arbitrary choice 
of summary statistics. 



One may interpret (3.11 as an artificial posterior distribution, which decomposes into a prior times n likelihood 
contributions li, as in (2.4), with 



h{e) = \ p{yU:r-l,0)t 



{ii'y.-tf.rii<4 



dVi 



We have seen that the feasibility of the EP algorithm is determined by the tractability of the following operation: to 
compute the two first moments of a pseudo-posterior, made of a Gaussian prior N^ifJ-^i, 5]_i), times a single likelihood 
contribution li{9). This immediately suggests the following EP-ABC algorithm. We use the EP algorithm, as described in 
Algorithm[l] and where the moments of such a pseudo-posterior are computed as described in Algorithm[2] that is, as Monte 
Carlo estimates, based on simulated pairs (0["l,yj'"l), where ©["1 - Nd{^l-^,'E_i), and y[™l|0["l - p(y,|y*^_^, 0["1). 

Since EP-ABC integrates one data-point at a time, it does not suffer from a curse of dimensionality with respect to n: 
the rejection rate of Algorithm [2] corresponds to a single constraint ||?/i — 2/,*|| < e, not n of them, and is therefore likely to 
be tolerably small even for small windows e. 



The only requirement of EP-ABC is that the factorisation of the likelihood, ( 3.2 1 , is chosen in such a way that simulating 
from the model, i.e. y ^ p(y\9) can be decomposed into a sequence of steps, where one samples from p(yi|2/i:i-i, 0), for 
i = 1, . . . , n. We shall see in our examples section, see Section [5] that several important applications of likelihood- free 
inference fulfil this requirement in a trivial way. We shall also discuss in Section [6] how other likelihood-free situations 
may be accommodated by the EP-ABC approach. 



3.2 Numerical stability 

EP-ABC is a stochastic version of EP, a deterministic algorithm, hence some care must be taken to ensure numerical 
stability. We describe here three strategies towards this aim. 

First, to ensure that the stochastic error introduced by each site update does not vary too much in the course of the 
algorithm, we adapt dynamically M, the number of simulated points, as follows. For a given site update, we sample 



repetitively Mq pairs (6 

threshold M,„in. Then we compute the moments (|3.3|) and (|3.4 



,yi 



), as described in Algorithm |2 



until the total number of accepted points exceeds some 
based on all the accepted pairs. 
Second, EP-ABC computes a great deal of Monte Carlo estimates, based on IID (independent and identically distrib- 
uted) samples, part of which are Gaussian. Thus, it seems worthwhile to implement variance reduction techniques that are 
specific to the Gaussian distribution. After some investigation, we recommend the following quasi-Monte Carlo approach. 
We generate a Halton sequence ^I™' of dimension d, which is a low discrepancy sequence in [0, 1]**, and take 



5)1' 



/^ 



i_,$ 



-1 Ubn]\ 



L ,U 



4 Speeding up EP-ABC in the IID case 



where /i_i, S_i are the moment parameters corresponding to the natural parameters r_i, Q^i, L^i is the Cholesky lower 
triangle of S-i, and $^^ returns a vector that contains the A^(0, 1) inverse distribution function of each component of 
the input vector. We recall briefly that a low discrepancy sequence in [0,1]'* is a deterministic sequence that spreads 
more evenly over the hyper-cube [0, l]"* than a sample from the Uniform distribution would; we refer the readers to e.g. 



Chap. 3 of Gentle (20031 for a definition of Halton and other low discrepancy sequences, and the theory of quasi-Monte 
Carlo. Rigorously speaking, this quasi-Monte Carlo version of EP-ABC is a hybrid between Monte Carlo and quasi-Monte 
Carlo, because the y]™ are still generated using standard Monte Carlo. However, we do observe a dramatic improvement 
when using this quasi-Monte Carlo approach. An additional advantage is that one may save some computational time by 
generating once and for all a very large sequence of $^^ ^^l^lj vectors, and store it in memory for all subsequent runs of 
EP-ABC. 

The third measure we may take is to slow down the progression of the algorithm such as to increase stability, by 
conservatively updating the parameters of the approximation in Step 3 of Algorithm [l| that is, Xi ^r- a{Xh—X.-i) + {l—a)Xi. 



Standard EP is the special case with a ~ 1. Updates of this type are suggested in Minka (20041. 

In our experiments, we found that the two first strategies improve performance very significantly, and that the third 
strategy is sometimes useful, for example in our reaction time example, see Section [5. 4[ 

3.3 Evidence approximation 



In this section, we normalise the ABC posterior (3.11 as follows: 



Me\y*) = ^^PWn|/p(yd2/^.^i,g) ^^''';^(';)''^'^ %} , (3.5) 

where Wi(e) is the normalising constant of the Uniform distribution with respect to the ball of centre y*, radius e, and 
norm || • ||. For the Euclidean norm, and assuming that the i/i's have the same dimension dy, one has: Wi(e) — ti,;(l)e'*'', 
wit h Vi{l)= Tr'^y/^/T (dy/2 + 1); e.g. f,(l) = 2 if dj^ = 1, v,{l) ^ tt if dy ^ 2. 

Wilkinson (|2008[) shows that a standard ABC posterior such as (1.1) can be interpreted as the posterior distribution 



of a new model, where the summary statistics are corrupted with a uniformly-distributed noise (assuming these summary 
statistics are sufficient). The expression above indicates that this interpretation also holds for this type of ABC posterior, 
except that the artificial model is now such all the random variables j/i are corrupted with noise (conditional on y^.j_]^). 



The expression above also raises an important point regarding the approximation of the evidence. In (3.5), the norm- 
alising constant Pe{y*) is the evidence of the corrupted model, which converges to the evidence p{y*) of the actual model 
as e — >■ 0. On the other hand, EP-ABC targets (|3.1|, and, in particular, see Section 2.4 produces an EP approximation 



of its normalising constant, which is Pe{y*) YVi=i "i(^)- Thus, one needs to divide this EP approximation by Y[i=i "^ii^) 
in order recover an approximation of p^(y*). We found in our simulations that, when properly normalised as we have 
just described, the approximation of the evidence provided by EP-ABC is particularly accurate, see Section [5] In con- 
trast, standard ABC based on summary statistics cannot provide an approximation of the evidence, as explained in the 
Introduction. 

4 Speeding up EP-ABC in the IID case 

Typically, the main computational bottleneck of EP-ABC, or other types of ABC algorithms, is simulating data-points 
from the model. In this section, we explain how these simulations may be recycled throughout the iterations in the IID 
(independent and identically distributed) case, so as to significantly reduce the overall computational cost of EP-ABC. 

Our recycling scheme is based on a a straightforward importance sampling strategy. Consider an IID model, with 
likelihood p{y*\9) = Y[i=iEiyiW)- Assume that, for a certain site i, pairs {9'-"^\y'-"^') are generated from q^i{6)p{y\0), 
as described in Algorithm pi We have removed the subscript i in both y[™l and p{y\d), to highlight the fact that the 
generative process of the data-points is the same for all the sites. The next update, for site i + 1, requires computing 
moments with respect to (7_(i_|_i)(0)p(y|0)l {||y — ?;*_^i|| < e}. Thus, we may recycle the simulations of the previous site 
by assigning to each pair (0['"1, j/'™!) the importance sampling weight: 



i^^^Si^ -{«»"-««-) 



(6>M) 

and compute the corresponding weighted averages. 

Obviously, this step may also be applied to the subsequent sites, i + 2,i + 3, . . ., until one reaches a stage when 
the weighted sample is too degenerated. When this happens, "fresh" simulations may be generated from the current 



5 Case studies 



Algorithm 3 Computing the inoments of the hybrid distribution in the hkehhood-free setting, recychng scheme for IID 

models. 

Inputs: i, e, current weighted sample (0l'"l, y'^Om^ii moment parameters p, and S (resp. fi^i and S_i) that correspond to 
the site where data were re-generated for the last time (resp. that correspond to the Gaussian approximation Yii^t Qji^))- 

1. Compute the importance sampling weights 

7V(0M;/i,S) W'-''-yt\\<^} 

where N{9: fi, S) stands for the Gaussian N{fjL, S) probability density evaluated at 9, and the effective sample size: 



ESS 






E:Li,(-rO^ 



2. If ESS < ESSmin, replace (6»[™l,y["l)*f^i by M IID draws from q^,{9)p{y\e), set wj""' = ^\\\yi"^i-y*\\<e}-- ^nd 
p, = Hi, E = E_i. 

3. Compute the following importance sampling estimates: 



1 ^^ 



E^Li^i^'e""! 



m=l ^'J 

and 

•^/i — ^ M/iM/i- 

Return (gt"'',;/'"')^'^!, A, S, Z^, fih and S;,. 

site. Algorithm [3] describes more precisely this recycling strategy. To detect weight degeneracy, we use the standard ESS 



(Effective Sample Size) criterion of Kong et al. (19941: we regenerate when the ESS is smaller that some threshold ESSn 
The slower the EP approximation evolves, the less often regenerating the data-points is necessary, so that as the 

approximation gradually stabilises, we do not need to draw any new samples any more. Since EP slows down rapidly 

during the first two passes, most of the computational effort will be devoted to the early phase, and additional passes 

through the data will come essentially for free. 

In non-IID cases several options are still available. For some models the data may come in blocks, each block made up 

of IID data-points (think for example of a linear model with discrete predictors) . We can apply the strategy outlined above 



in a block-wise manner (see the reaction times example, section 5.4 1. In other models there may be an easy transformation 
of the samples for data-point i such that they become samples for data-point j ^ i, or one may be able to reuse part of 
the simulation. 

5 Case studies 

5.1 General methodology 

In each scenario, we apply the following approach. In a first step, we run the EP-ABC algorithm. We may run the 
algorithm several times, to evaluate the Monte Carlo variability of the output, and we may also run for different values of 
e, in order to assess the sensitivity to this approximation parameter. 

In a second step, we run alternative algorithms, that is, either an exact (but typically expensive) MCMC algorithm, or 
an ABC algorithm, based on some set of summary statistics. The ABC algorithm we implement is a Gaussian random walk 
version of the MCMC- ABC algorithm of Marjoram et al. (2003|). This algorithm targets a standard ABC approximation. 



i.e. (1.1 1, that corresponds to a single constraint {||s(y) — s{y*)\\ < e}, for some vector of summary statistics s, and 
some e; the specific choices of s and e are discussed for each application. We calibrate the tuning parameters of these 
MCMC algorithms using the information provided by the first step: we use as a starting point for the MCMC chain the 
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expectation of the approximated posterior distribution provided by the EP-ABC algorithm, random walk scales are taken 
to be some fraction of the square root of the approximated posterior variances, and so on. This makes our comparisons 
particularly unfavourable to EP-ABC. Despite this, we find consistently that the EP-ABC algorithm is faster by several 
orders of magnitude, and leads to smaller overall approximation errors. We report computational loads both in terms of 
CPU time (e.g. 30 seconds) and in terms of the number of simulations of replicate data-points yi. The latter should be 
typically the bottleneck of the computation. 

All the computations were performed on a standard desktop PC in Matlab; programs are available from the first 
author's web page. 



5.2 First example: Alpha-stable Models 

Alpha-stable distributions are useful in areas (e.g. Finance) concerned with noise terms that may be skewed, may have 
heavy tails and an infinite variance. A univariate alpha-stable distribution does not admit a close-form expression for its 
density, buy may be specified through its characteristic function 



$x(i) 



iSt 



a^ 1 
a = 1 



- 7" \tr {l + i/3 (tan ^) sgn(t) {\^t\'-'' - l) }" 
7|i|{l + */3fsgn(t)log|7t|}] 

where a determines the tails, < a < 2, /3 determines skewness, — 1 < /3 < 1, and 7 > and S are respectively scale and 
location parameters; see Nolan (2012 Chap. 1) for a general introduction to stable distributions. 

Peters et al. ( 2010 1 consider a model of n i.i.d. observations yi, i = 1, . . . ,n from a univariate alpha-stable distribution. 

Likelihood- free inference is appealing in this context, 
the algorithm of 



and propose to use the ABC approach to infer the parameters, 
because sampling from an alpha-stable distribution is fast (using e.g. 
computing its density is cumbersome. 



Chambers et al. 19761, while 



Trying EP-ABC on this example is particularly interesting for the following reasons: (a) Peters et al. (2010) show that 



choosing a reasonable set of summary statistics for this problem is difficult, and that several natural choices lead to strong 
biases; and (b) since alpha-stable distributions are very heavy-tailed, the posterior distribution may be heavy-tailed as 
well, which seems a challenging problem for a method based on a Gaussian approximation such as EP-ABC. 

Our data consist of n = 1264 rescaled log-returns, yt = 100 * \og{zt/zt-i), computed from daily exchange rates zt of 
AUD (Australian Dollar) recorded in GBP (British Pound) between 1 January 2005 and 1 December 2010. (These data are 
publicly available on the Bank of England's web-site.) We take = ($~^(a;/2), $~^ ((/3 + l)/2) , log7, 5) where $ is the 
A^(0, 1) cumulative distribution function, and we set the prior to N (O4, diag(l, 1, 10, 10)). Note however that our results are 
expressed in terms of the initial parametrisation a, /3, 7 and 6; i.e. for each parameter we report the approximate marginal 
posterior distribution obtained through the appropriate variable transform of the Gaussian approximation produced by 
EP-ABC. We run the EP-ABC algorithm (recycling version, as model is IID, see Section [4|, with e = 0.1, M = 8 x 10^, 
ESSniin — 2 X 10^, and || • || set to the Euclidian norm in R (i.e. the n constraints in (3.11 simplify to \yi — y*\ < e). 



Variations over ten runs are negligible. Average CPU time for one run is 39 minutes, and average number of simulated 
data-points over the course of the algorithm, is 4 x 10^. 

We first compare these results with the output of an exact random-walk Hastings-Metropolis algorithm, which relies on 
the evaluation of an alpha-stable probability density function for each data-points (using numerical integration) . Because 
of this, this algorithm is very expensive. We ran the exact algorithm for about 60 hours (2 x 10^ iterations). One sees in 
Figure [STT] that the difference between EP-ABC and the exact algorithm are negligible. Results from the exact algorithm 
must be taken with some caution, as more iterations would have been required according to standard MCMC convergence 
checks; autocorrelation time is higher than 100 in certain dimensions. 

We then compare these res ults with those obta ined by MCMC- ABC, for the set of summary statistics which performs 
best among those discussed by Peters et al. (2010 see Si in Section 3.1). We run 2 x 10^ iterations of this sampler, which 



leads to about 50 times more simulations from an univariate alpha-stable distribution than in the EP-ABC runs above. 
Through pilot runs, we decided to set e = 0.03, which seems to be as small as possible, subject to having a reasonable 
acceptance rate (2 x 10"'^) for this computational budget. In Figure 5.1 we see that the posterior output from this 
MCMC- ABC exercise is much more biased than EP-ABC. As explained in the previous section, we have set the starting 
point of the MCMC- ABC chain to the posterior mode. If initialised from some other point, the sampler typically takes a 
much longer time to reach convergence, because the acceptance rate is significantly lower in regions far from the posterior 
mode. 

Finally, we also use EP-ABC, with the same settings as above, e.g. e = 0.1, in order to approximate the evidence 
of the model above (—1385.8) and two alternative models, namely a symmetric alpha-stable model, where /? is set to 
(—1383.8), and a Student model (—1383.6), with 3 parameters (scale 7, position 6, degrees of freedom v, and a Gaussian 
prior iV(03, diag(10, 10, 10)) for 9 = (log i/, 7, 6)). (Standard deviation over repeated runs is below 0.1.) One sees that there 
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Fig. 5.1: Marginal posterior distributions of a, /?, 7 and 8 for alpha-stable model: MCMC output from the exact algorithm 
(histograms), approximate posteriors provided by first run of EP-ABC (solid line), kernel density estimates 
computed from MCMC- ABC sample based on summary statistic proposed by Peters et al. (20101 (dashed line). 
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Fig. 5.2: Lokta-Volterra example: simulated dataset 



no strong evidence of skewness in the data, and that the Student distribution and a symmetric alpha-stable distribution 
seem to fit equally well the data. We obtained the same value (—1383.6) for the evidence of the Student model when 
using the generalised harmonic mean estimator ( Gelfand and Dey] 1994) based on a very long chain of an exact MCMC 
algorithm. For the two alpha-stable models, this approach proved to be too expensive to allow for a reliable comparison. 



5.3 Second example: Lotka-Volterra models 

The stochastic Lotka-Volterra process describes the evolution of two species Y\ (prey) and Y^ (predator) through the 
reaction equations: 

Yi ^ 2Yi 
yi+Fa ^ 21^2 
Y2 ^ 

This chemical notation means that, in an interval [t, t+dt], the probability that one prey is replaced by two preys is ridt, and 
so on. Typically, the observed data y* = (j/i, . . . , y„) are made of n vectors y* — {Vi 1,11*2) ™ N^, which correspond to the 
population levels at integer times. We take = (Iogri,logr2,logr3). This model is Markov, p(2/*|y*.j_j^, 0) = p{y*\y*_^,9) 
for i > 1, and one can efficiently simulate from p{y*\y*_j^,9) using Gillespie (1977rs algorithm. On the other hand, the 
density p{y*\y*_i, 6) is intractable. This makes this model a clear candidate both for ABC, as noted by Toni et al. ( 2009 1 , 



and for EP-ABC. Boys et al. (2008 1 show that MCMC remains feasible for this model, but in certain scenarios the proposed 



schemes are particularly inefficient, as noted also by Holenstein^ (2009i Chap. 4). 

Following the aforementioned papers, we consider a simulated dataset, corresponding to rates ri 
r^ = 0.3, initial population values y^ i = 20, j/q 2 = 30 and n 



5.2 



0.4, r2 = 0.01, 
Since the observed data are integer- 



50; see Figure 

valued, we use the supremum norm in (3.1 ), and an integer- valued e; this is equivalent to imposing simultaneously the 2n 
constraints \yi^i — y*i\ < e and \yi.2 ~ J/* 2 1 — ^ i^^ the ABC posterior. 

First, we run EP-ABC (standard version) with Afmin — 4000, and for both e = 3 and e = 1. We find that a single 
pass over the data is sufficient to reach convergence. For e = 3 (resp. e = 1), CPU time for each run is 2.5 minutes (resp. 
25 minutes), and number of simulated transitions p{yi\y*_^,9) is about 10^ (resp. 9 x 10^); marginal posteriors obtained 
through EP-ABC are reported in Figure |5.3| 



When applying ABC to this model, Toni et al. (20091 uses as a pseudo-distance between the actual data y* and 
the simulated data y the sum of of squared errors. In Wilkinson (2008 1 's perspective discussed in Section l4J this is 
equivalent to considering a state-space model where the latent process is the Lokta-Volterra process described above, 
and the observation process is the same process, but corrupted with Gaussian noise. Thus, instead of a standard ABC 
algorithm, one may use a MCMC sampler specifically designed for state-space models in order to simulate from the ABC 



approximation of Toni et al. (2009). Following Holenstein (2009 Chap. 4), we consider a Gaussian random walk version 
of the marginal PMCMC sampler. This algorithm is a particular instance of the state of the art PMCMC framework 



of Andrieu et al. (20101, which is based on the idea of approximating the marginal likelihood of the data by running a 



particle filter of size N at each iteration of the MCMC sampler. 
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Fig. 5.3: Lokta-Volterra example: marginal posterior densities of rates ri, r2, ra, obtained from PMCMC algorithm 
(histograms), and from ABC-EP, for e = 3 (top) and e = 1 (bottom); PMCMC results for e = 1 could not be 
obtained in a reasonable time. The solid lines correspond to the average over 10 runs of the moment parameters 
of the Gaussian approximation, while the dashed lines correspond to the 10 different runs. 



In Figure 5.3 we report the posterior output obtained from this sampler, run for about 2 x 10^ iterations and N = 1000 
particles (2 days in CPU time, 10^° simulated transitions p{y.i\y*_^,6)), with random walk scales set to obtain a 0.25 
acceptance rate. These plots correspond to e = 3, and a state-space model with an uniformly distributed observation 
noise. In Figure 5.3 one detects practically no difference between PMCMC and EP-ABC with e = 3 (black lines), although 



the CPU time of the latter was about 1500 smaller. 

The difference between the two EP-ABC approximations (corresponding to e = 1 and e = 3) is a bit more noticeable. 
Presumably, the EP-ABC approximation corresponding to e = 1 is slightly closer to the true posterior. We did not manage 
however to obtain reliable results from our PMCMC sampler and e = 1 in a reasonable time. 



5.4 Third example: Race models of reaction times 



Reaction time models seek to describe the decision behaviour of (human or animal) subjects in a choice task (Luce 1991 
Meyer at al. , 1988 Ratcliff 1978). In the typical experiment, subjects view a stimulus, and must choose an appropriate 



response. For example, the stimulus might be a set of moving points, and the subject must decide whether the points 
move to the left or to the right. 

Assuming that the subject may choose between k alternatives, one observes independent pairs, jji = {di,ri), where 
di € {1, . . . , fc} is the chosen alternative, and r^ > is the measured reaction time. For convenience, we drop for now the 
index i in order to describe the random distribution of the pair {d, r). 

Reaction time models assume that the brain processes information progressively, and that a decision is reached when 
a sufficient amount of information has been accumulated. In the model we use here (a variant of typical models found 
in e.g. Ratcliff and McKoon 2008 Bogacz et al. 2007 1 k parallel integrators represent the evidence ei (i) , . . . , e^ (t) in 



favour of each of the k alternatives. The model is illustrated on Figure 5.4 The first accumulator to reach its boundary b 



wins the race and determines which response the subject will make. Each accumulator undergoes a Wiener process with 
drift: 



TdCjit) 



mj dt 



dWi 



where the nij^s are the drift parameters, the Wl''s are k independent Wiener processes; and r is a fixed time scale, 
r — 5ms. The measured reaction time is corrupted by a uniformly-distributed noise r„d, representing the "non-decisional 
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time" (RatclifF and McKoon 20081, i.e. the time the subject needs to execute the decision (prepare a motor command, 



press an answer key, ...). This model is summarised by the following equations: 

r = rd + Tnd, r„d~U[a, 6], 
7'fi = mininf {f : ej{t) = &,}, 

d — argmininf {i : e,(t) — bA . 



(We fix a and 6 to a = lOOms, b — 200ms, credible values from RatclifF and McKoon (20081) 

The model above captures the essential ideas of reaction time modelling, but it remains too basic for experimental 
data. We now describe several important extensions. First, a better fit is obtained if the boundaries are allowed to vary 
randomly from trial to trial (as in RatclifF^ ^1978) : we assume that bj = Cj + r, where r ~ N{0, e*), and s is a parameter 
to be estimated. Second, a mechanism is needed to ensure that the reaction times cannot be too large: we assume that 
if no boundary has been reached after 1 second, information accumulation stops and the highest accumulator determines 



the decision. Finally, one needs to account for lapses (Wichmann and Hill^ 2001J: on certain trials, subjects simply fail 



to pay attention to the stimuli and respond more or less at random. We account for this phenomenon by having a lapse 
probability of 5%. In case a lapse occurs, r^ becomes uniformly distributed between and 800 ms and the response is 
chosen between the alternatives with equal probability. Clearly, this generalised model remains amenable to simulation, 
although the corresponding likelihood is intractable. 

We apply this model to data from an unpublished experiment by M. Maertens (personal communication to Simon 
Barthelme). The dataset is made of 1860 observations, obtained from a single human subject, which had to choose 
between fc = 2 alternatives: "signal absent" (no light increment was presented), or "signal present" (a light increment 
was presented), under 15 different experimental conditions: 3 different locations on the screen, and 5 different contrast 
values. Following common practice in this field, trials with very high or very low reaction times (top and bottom 5%) were 
excluded from the dataset, because they have a high chance of being outliers (fast guesses, keyboard errors or inattention). 
The data are shown on Figure |5.5[ 

From the description above, one sees that five parameters, (m-i, TO2, ci, C2, s), are required to describe the random 
behaviour of a single pair (ri,di), when fc = 2. To account for the heterogeneity introduced by the varying experimental 
conditions, we assume that the 2 accumulation rates, rrii, ?7i2 vary across the 15 experimental conditions, while the 
3 parameters related to the boundaries, ci, C2 and s, are shared across conditions. The parameter 6 is therefore 33- 
dimensional. 

We note that this model would present a challenge for inference even if the likelihood function was available. It is 
difficult to assign priors to the parameters, because they do not have a clear physical interpretation, and available prior 
knowledge (e.g., that reaction times will normally be less than 1 second) does not map easily unto them. Moreover, the 
model is subject in certain cases to weak identifiability problems. For instance, if one response dominates the dataset, 
there is little information available beyond the fact that one drift rate is much higher than the other (or one threshold 
much lower than the other, or both). 

We re-parametrised the positive parameters ci, C2 as ci = e^, C2 = e^+ , and assigned a Af {0, 1) prior to A, 6, and s. 
Taking a A^(0, 5^) prior for these 3 quantities led to similar results. Some experimentation suggested that the drift rates 
could be constrained to lie between -0.1 and 0.1, because values outside of this interval seem to yield improbable reaction 
times (too short or too long). We assigned a [—0.1,0.1] uniform prior for the 30 drift rates, and applied the appropriate 
transform, i.e. x — > 0.1 -|- 5^~^{x), in order to obtain a N{0, 1) prior for the transformed parameters. 

After a few unsuccessful attempts, we believe that this application is out of reach of normal ABC techniques. The 
main difficulty is the choice of the summary statistics. For instance, if one takes basic summaries (e.g. quartiles) of 
the distribution of reactions times, under each of the 15 experimental conditions, one ends with a very large vector 
s of summary statistics. Due to the inherent curse of dimensionality of standard ABC (Blum 2010[), sampling enough 



datasets, of size 1860, which are sufficiently close (in terms of the pseudo-distance \\s{y) — s{y*)\\) would require enormous 
computational effort. Obviously, taking a much smaller set of summary statistics would on the other land lead to too poor 
an approximation. 

Some adjustments are needed for ABC-EP to work on this problem. First, notice that within a condition the datapoints 
are IID so that the posterior distribution factorises over IID "blocks". We can therefore employ the recycling technique 
described in Section |4] to save simulation time, by going through the likelihood sites block- by-block. Second, since data- 
points take values in {1, 2} xR+, we adopt the following set of ABC constraints: 1 {di = d*} 1 {jlogri — logr*| < e}, where 
y* = {d* , r*) denotes as usual the actual datapoints. Third, we apply the following two variance-reduction techniques. One 
stems from the fact that each site likelihood does not depend on all the 33 parameters but on a subset of size 5. In that case, 
using simple linear algebra, one can see that is possible to update only the marginal distribution of the EP approximation 
with respect to these 5 parameters; see the Appendix for details. The second is a simple Rao-Blackwellsiation scheme 
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Fig. 5.4: A model of reaction time in a choice task. The subject has to choose between k responses (here, "Signal present" 
and "Signal absent") and information about each accumulates over time in the form of evidence in favour of one 
and the other. Because of noise in the neural system "evidence" follows a random walk over time. A decision is 
reached when the evidence for one option reaches a threshold (dashed lines) . The decision time in this example is 
denoted by the star: here the subject decides for 'B' after about 150ms. The fact that the thresholds are different 
for "Signal Present" and "Signal Absent" capture decisional bias: in general, for the same level of information, 
the subject favours option "Signal Absent". 
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(a) Probability of answering "Signal present" as a function of relative target contrast in a detection experiment, at three 
different positions of the target (data from one subject). Filled dots represent raw data, the grey curves are the result 
of fitting a binomial GLM with Cauchit link function. The light grey band is a 95% confidence band. As expected 
in such experiments, the probability of detecting the target increases with relative target contrast. 
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(b) Reaction time distributions conditional on target contrast, target position, and response. The semi-transparent dots 
represent reaction times for individual trials. Horizontal jitter was added to aid visualisation. The lines represent 
linear quantile regressions for the 30%, 50% and 70% quantiles. 



Fig. 5.5: Choice (a) and reaction time (b) data in a detection experiment by Maertens. 
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Fig. 5.6: Probability of answering "Signal Present" as a function of contrast: data versus (approximate) posterior predictive 
distribution. We sampled the posterior predictive distribution and summarised it using a binomial GLM with 



Cauchit link, in the same way as we summarised the data (see (a) in Figure 5.5 1. The data are represented as a 
continuous line, the predictive distribution as dotted. The posterior predictive distribution is very close to the 
data. The shaded area corresponds to a 95% confidence interval for the fit to the data. 



that uses the fact that the non-decisional component Vnd is uniformly distributed, and may therefore be marginalised out 
when computing the EP update. 

We report below the results obtained from ABC-EP with e = 0.16, A/ = 3 x lO'^, a ~ 0.4 (see end of Section 3.2), 
and 3 complete passes over the data; CPU time was about 40 minutes. Results for smaller values of e, e.g. e = 0.1, were 
mostly similar, but required a larger CPU time. 

Since we could not compare the results to those of a standard ABC algorithm, we asses the results through posterior 
predictive checking. For each of the 15 experimental conditions, we generate 5,000 samples from the predictive density, 
and compare the simulated data with the real, as follows. 

The marginal distribution of responses can be summarised by regressing the probability of response on stimulus 



contrast, separately for each stimulus position (as on Figure 5.51, and using a binomial generalized linear model (with 
Cauchit link function). Figure 5.6 compares data and simulations, and shows that the predictive distribution successfully 
captures the marginal distribution of responses. 

To characterise the distribution of reaction times, we look at means and inter-quantile intervals, conditional on the 
response and the experimental conditions. The results are presented on Figure |5.7[ The predictive distributions capture 
the location and scale of the reaction time distributions quite well, at least for those conditions with enough data. Such 
results seem to indicate that, at the very least, the ABC-EP approximate posterior corresponds to a high-probability 
region of the true posterior. 



6 Discussion 

In its current presentation, EP-ABC seems to have two limitations. First, it assumes a Gaussian prior; and second, it 
relies on a particular factorisation of the likelihood, which makes it possible to simulate sequentially the datapoints. 

The first limitation is mostly a technicality. EP-ABC, like any EP algorithm, may accommodate a non-Gaussian prior, 
by just treating the prior as an additional site. We have not described this generalisation in the paper, because we find 
more expedient in practice to simply reparametrise the model so as to make the prior Gaussian, as we did implicitly in our 
numerical examples. Since the only input required by EP-ABC from the model is an algorithm that samples a data-point, 
for a given parameter value, applying a particular one-to-one transform to the parameter is trivial. 

The second limitation requires more discussion. In this paper, we focused our attention on important applications 
of likelihood-free inference where the likelihood is trivial to factorise; either because the datapoints are independent, or 
Markov. But ABC-EP is not Umited to these two situations. First, quite a few time series models may be simulated 
sequentially, even if they are not Markov. For instance, one may apply straightforwardly EP-ABC to a GARCH-stable 
model (e.g. |Liu and Brorsen[ |1995| |Mittnik et al.[ |2000[ |Menn and Rachev[ [2005[ ) , which is a GARCH model (Bollerslevj 



1986 ) with an alpha stable-distributed innovation. Second, one may obtain a simple factorisation of the likelihood by 
incorporating latent variables into the vector d of unknown parameters. The most obvious application of this idea is 
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Fig. 5.7: Reaction time distributions conditional on decision: data versus posterior predictive distributions. The reaction 
times conditioned on contrast, position and response are shown as grey dots and summarised via mean and 
10-90% inter-quantile range (continuous hues). The posterior predictive distributions computed from samples 
are summarised and shown with an offset to the right (dotted lines). The conditional densities are well captured, 
given sufficient data. 



hidden Markov models, where the observations are conditionally independent, given the fixed parameters and the hidden 
process; see e.g. Dean et al. (20111, Calvet and Czellar ( [2011 ') for applications of likelihood free inference to this class 
of models. In such a situation, however, one needs to ensure that EP-ABC provides stable results, despite the possibly 
large dimension of 6. The marginal update outlined in the Appendix may be helpful in this context. Third, it might be 
possible in certain situations, such as repeated experiments, or in the presence of mixed effects (in the spirit of our vision 
example), to use EP-ABC in order to break the posterior down in several "pieces", and treat each piece through standard 
ABC techniques. At all rates, we believe that such extensions of EP-ABC are exciting avenues for future research. 

Finally, one may always criticise EP-ABC as being based on EP, a machine learning method which shows good 
performance empirically in many situations, but which currently lacks a proper mathematical justification, both in terms 
of convergence of the algorithm, and in terms of assessing the approximation error. Work in this direction has started 



recently ( Titterington 2011 1, but these preliminary results do not address the issue of the stability of the algorithm when 



each site update introduces a stochastic error, as in EP-ABC. This is another important direction for future research, and 
EP-ABC may well require a more technical convergence study, because of its stochastic nature. 

As of now, we would like to reply to this criticism by two remarks. First, even if EP-ABC delivers accurate results 
directly, the user is free to use EP-ABC as a first, quick step, in order to to calibrate a second, more expensive step based 
on a standard ABC approximation. Second, and this may well be the main message of this paper: it seems quite absurd 
to reject an EP-based approach, if the only alternative is an ABC approach based on summary statistics, which introduces 
a bias which seems both larger (according to our numerical examples) and more arbitrary, in the sense that in real-world 
applications, one has little intuition and even less mathematical guidance on to why p{9\s{y)) should be close to p{8\y) 
for a given set of summary statistics s. Not to mention that introducing summary statistics prevents from computing 
the evidence of the model, as we have already explained. In fact, we argue that this dependence on summary statistics is 
currently the main limitation of the ABC approach, and that it is essential that this issue is addressed in future research 
in likelihood- free inference, whether through EP or by other means. 
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Appendix: marginal EP updates 

In example 2 we make use of the fact that some sites only depend on a subset of the parameters to obtain more stable 
updates. This strategy is potentially very important for the extension of ABC-EP to Hidden Markov models suggested in 
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the discussion. We list below some results for multivariate Gaussian families that are essential in deriving these special 
EP updates. 

We generalise the problem slightly to computing the moments of a hybrid h (8) ex / {A0)J\f {0; /xq, Sq), the product 
of a multivariate Gaussian density and a likelihood which is a function of A6, where A is a matrix of dimension k x m, 
m < k. When A is a sub-matrix of the identity matrix we have the special case of a likelihood which only depends on a 
subset of the parameters. 

For the normalisation constant, we have that: 



/ / (A0) N {9; Mo, So) dO = f f (z) I f N {9; /xo, So) d0 ) dz 

J J \J{e:Ae=z} J 

= y"/(z)AA(z,AMo,ASoA*)dz (6.1) 

where we have defined a new variable z = A9. Let us denote by 7 the above integration constant. We can now derive 
an expression for the mean of the hybrid: 

7-1 /'0/(A0)AA(0;/xo,So)d0 = 7"' / / (z) ( / 0AA(0; /xo, So)d0 ) dz 

J J \J{e:Ae=z} J 

= 7-iy"/(z)AA(z,A/xo,ASoA*)^(0|z)dz (6.2) 

where E{9\z) is the conditional expectation of 9 given z. For multivariate Gaussians E{9\z) is a linear function of z: 

E{9\z) =Vz + h (6.3) 

where V = SoA* (ASoA*)"^ and b = /xo - SoA* (ASoA*)"^ A/^o. 



We inject (6.3) into m/2 



Eh{9) = 7-iy/(z)AA(z,A/xo,ASoA*)(Vz + b)dz 
= Vi^„(z)+b 
where E^ denotes expectation over the hybrid density. A similar calculation yields an expression for the covariance: 

Covh (9) == SoA* (ASoA*)"^ ASo + VCovh (z) V* (6.4) 

These three results yield computational savings and increased stability, because the moments of the hybrid distribution 
over 9 can be obtained from the moments of the marginal hybrid over z, which has lower dimensionality. 
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